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1. Introduction 


It is well-known that the microstructure of cementitious materials dictates the properties and 
performance of the material. The microstructure in turn is a function of time, processing 
techniques, as well as the constituent material properties and their proportions. Cement paste 
microstructures are generally constituted of solid and pore phases; the influence of porosity on 
the mechanical properties and durability of concrete has been well-elucidated for many decades. 
The solid phase generally consists of cement hydration products and unhydrated/unreacted 
materials, which depend on the water-to-binder ratio (w/b) and the reactivity of the starting 
materials [1—4]. While in well-hydrated plain ordinary Portland cement (OPC) pastes, C-S-H gel, 
calcium hydroxide (CH), and unhydrated clinker are invariably the only solid phases present, 
multi-component blends like ultra-high performance (UHP) cementitious pastes contain different 
types of C-S-H based on their density (e.g., low-density or LD, high-density or HD), ultra-high 
stiffness phases, mixed reaction products, and unreacted particles of cement, fly ash, and 
limestone [5—8]. Thus, the microstructural complexity increases with the use of multiple-blend 
binders, requiring more sophisticated and refined methods for microstructural characterization 
and analysis. 


Typically, scanning electron microscopy (SEM) coupled with energy-dispersive X-ray 
spectroscopy (EDS) is used to extract the chemical information of the microstructure in cement- 
based materials [9-12]. Grid nanoindentation on these microstructures provide the 
nanomechanical properties (or more accurately, micromechanical properties since the region of 
influence of the indents is of the order of 1-3 um) [13—15] Coupling nanoindentation data (1.e., 
modulus and/or hardness of the indented locations) with SEM-EDS-based microstructural 
chemical mapping (i.e., intensity of species such as Ca, Si, and Al) has been shown to provide 
much needed microscale chemistry-property relationships for cement-based materials [5,16,17]. 
Clustering algorithms such as k-means clustering or those based on Bayesian methods have been 
used in conjunction with nanoindentation and chemical maps of cement pastes [18-20]. The 
microscale properties thus obtained are upscaled using analytical or numerical tools to predict 
the bulk properties of the material such as elastic modulus, which are important in design 
[19,21,22]. 


Grid nanoindentation and chemical mapping produce large datasets, which when judiciously 
combined with machine learning (ML), enable the development of unbiased structure-property 
estimators. The use of ML to relate the properties of cement-based materials to the mixture 
proportions [23—27], or to a limited extent, to their constitutive phases [20,28] has been reported. 
A recent work by the authors demonstrated the use of ML to predict the nanoindentation modulus 
of different phases in UHP cementitious pastes using the intensity of chemical species at 
indentation locations as inputs [29]. It was shown that the efficiency of predicting the modulus 
suffers when the microstructure becomes more complex. In addition, acquisition of 
nanoindentation data can be time-and-cost-prohibitive. Thus, a ML-based classification approach 
is adopted in this work. If ML models can be trained on elemental maps from SEM-EDS and 
corresponding nanoindentation data, to classify locations in a SEM image as belonging to the 
appropriate microstructural phase (e.g., LD or HD C-S-H, unhydrated clinker, etc.), it facilitates 
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real-time characterization. In this paper, the focus is on using SEM-EDS information (with or 
without nanoindentation data) to identify the constitutive phases as labeled from clustering 
analysis of nanoindentation data and chemical intensities. This allows for very quick first-order 
determinations of the effective material properties. Artificial Neural Networks (ANN) and 
hierarchical decision trees are the ML approaches adopted in this study. The classification 
models are implemented on two UHP cement pastes, whose properties have been extensively 
reported [30,31], and validated on two other cement pastes whose characteristics are adopted 
from the literature [32,33]. 


2. Data and organization 


2.1. VHP cement pastes 


Nanoindentation and SEM-EDS chemical data utilized in this study belong to two UHP 
cementitious pastes (referred to as UHP-1 and UHP-2 in Table 1) which have been studied in 
detail by the authors [15,16,30,31]. As mentioned earlier, this dataset has been used in predicting 
the indentation modulus from chemical species intensities using ML [29]. Both UHPs contain 
multiple cement replacement materials (Class F fly ash, silica fume, fine limestone powder) of 
varying sizes and reactivity, and a low water-to-binder ratio (w/b), as shown in Table 1. Further 
details on chemical characteristics of the raw materials, mixture proportions, and mixing and 
curing conditions can be found in [15,16]. The paste mixtures were cured in moist conditions 
until their testing duration. 


Table 1 
Proportions (mass-based) of the UHP cementitious pastes employed in this study. 


Constituent mass fraction in the binder 


Mixture w/b Curing regime 
OPC Fly ash Silica fume Limestone 

UHP-1 0.70 0.175 0.075 0.05 0.20 Moist curing, 30d, 90d 

UHP-2 0.50 -- 0.20 0.30 0.20 Moist curing, 30d 


2.2. Nanoindentation and chemical mapping 


A brief description of the procedure for nanoindentation and chemical mapping of UHP-1 and 
UHP-2 pastes is described here. The sample preparation included specimen cutting, 
ultrasonication in isopropyl alcohol (IPA) [34], and polishing [35-37]. Nanoindentation was 
carried out using an Ultra Nanoindentation Tester (UNHT®; Anton Paar). Each sample had at 
least 1250 indents split among several grids in different locations to capture the heterogeneity in 
the microstructure of multi-component UHP paste systems. Indentations were performed in force 
control mode with a maximum displacement cutoff of 250 nm (0.25 um) with loading profile 
detailed in [15,16]. This depth corresponded to an interaction volume idealized as a hemisphere 
with a radius 3 to 5 times the maximum cutoff [5,38,39]. The hardness (H) and the effective 
Young’s Modulus (M) were determined following the Oliver and Pharr method [40,41]. 
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The specimen surfaces were imaged after the nanoindentation tests using a SEM (SNE-4500M 
Plus) coupled with EDS (Bruker EDS with ESPRIT software). The application of SEM-EDS for 
compositional identification of cement hydration phases is discussed in [12,42]. Back-scattered 
electron (BSE) mode imaging (Fig. 1(a)) was performed with a beam current of 110 pA, a 
working distance of ~10 mm, and an accelerating voltage of 15 keV[16]. A BSE image was taken 
over the grid before EDS was performed at 50 kcps. It has been shown that, in cementitious 
materials, most of the characteristic X-rays escaping the material are generated within a depth of 
2 wm [5,32], which is in line with the interaction depth for nanoindentation. To relate the 
elemental EDS information to the nanomechanical data, a MATLAB localization algorithm was 
implemented to align the optical image of the nanoindentation grid to the EDS chemical maps, as 
detailed in [16,18,29]. Brightness of the EDS chemical maps was auto-scaled by the data- 
collection software. Fig. 1(b) shows the Ca EDS map. Al, Si, and Fe maps were similarly 
obtained. Across different maps, the number of X-ray counts associated with the same brightness 
value varies, and hence EDS maps are qualitative measures of the concentration of elements in 
each indentation grid. For statistical analysis, the RGB intensities from the Al, Ca, Fe, and Si 
EDS maps (denoted as Ia}, Ica, Ipe,and Is; respectively) were matched with the corresponding 
nanomechanical data. Fig. 1(c) illustrates the translation of EDS map color intensity of Ca to the 
0-255 scale. In BSE imaging, the cube of the brightness (y*) can be related to the density of the 
phase [9]. This local density information is also used as an input parameter in the ML models 
described later. 


60 um 


(b) (c) 


Fig. 1. (a) BSE image of the 30-day UHP-1 paste, (b) Ca EDS map with blue dots added to show the 
location of one of the indentation grids after the alignment procedure, and (c) MATLAB graphic 
translating EDS map color intensity into 0-255 scale for Ca. 


2.3. Statistical cluster analysis from SEM-EDS and nanoindentation data 


To generate the labels of the constitutive microstructural phases to train the ML classification 
models, a Bayesian Information Criterion (BIC) with negative log likelihood method was 
implemented for statistical deconvolution (clustering) of the chemical intensities and the 
micromechanical properties [18]. If there exists n phases in the microstructure with each phase 
occupying a volume fraction of 9; (i = 1...n) such that )'7_, @; = 1, the properties of each phase 
can be approximated by a Gaussian distribution with a probability density function (PDF) given 
as: 
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PDF = Yi-1 $i Yi (1) 


Here, yj; is the vector of classification variables of the phase. The classification variables utilized 
in cluster analysis were indentation modulus M, indentation hardness H, and the normalized 
intensities of aluminum Ia), calcium Icy, iron Ie, and silicon Ig;. While the same statistical 
nanoindentation results can be fit using different number of phases and volume fractions [43], a 
maximum negative log likelihood estimation was used to find the PDFs that best represented the 
experimental data: 


NlogL = —max(log([], PDF(n;))) (2) 


Here, n; represents the distribution parameters, in the case of a Gaussian distribution the mean 
and standard deviation, that are iterated to maximize the likelihood function. Then, the BIC was 
minimized such that: 


BIC = 2 NlogL + plog(m) (3) 


In the above equation, m is the number of indentation points and p is the number of identifying 
parameters available at each indentation point (in this case six; four chemical intensities and two 
mechanical properties M and H) [18]. A summary of the constitutive phases identified from this 
clustering analysis is given in Table 2. They include low density (LD) C-S-H, high density (HD) 
C-S-H, an ultra-high stiffness (UHS) phase unique to the very low w/b cement pastes such as 
UHP mixtures, a mixed phase comprised of partially reacted starting materials such as fly ash or 
limestone and products such as carboaluminates, and residual clinker. The salient features of 
these phases have been elucidated in detail elsewhere [5,16,17,44,45]. As an example for the 
UHP-1 paste cured for 90 days, the clustering of M and H is shown in Fig. 2(a), while Fig. 2(b) 
depicts the normalized intensities of Ca at every indentation point and the corresponding M, and 
Fig. 2(c) showcases the normalized intensities of Ca vs. Si. Detailed analysis of the UHP paste 
clusters identified, and justification for their corresponding constitutive phase labels are 
described in [16]. 


Table 2 

Constitutive phases identified and their volume fractions (#) in the UHP pastes. FA, MS, L, and CA 
denotes fly ash, microsilica, limestone, and carboaluminates, respectively. The phase labels from 0-4 are 
the inputs to the ML classification algorithm. 


Volume fraction (#) 


Mixture Phase Phase Label 

30d 90d 

LD CSH/Residual MS 0 0.18 - 
HD CSH 1 0.38 0.40 
UHP-1 UHS Phase 2 0.19 0.23 
Mixed (FA, L, MS, CA) 3 0.12 0.17 
Clinker 4 0.13 0.20 

UHS Phase/CSH 2 0.42 - 

UHP-2 Mixed (L, MS) 3 0.41 - 

Clinker/Unreacted 4 0.17 - 
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150 


M (GPa) 


Clinker/Unreacted 
= Mixed (F, M, L, CA) 
UHS Phase 


0 5 10 15 


Fig. 2. Clustering analysis of the 90-day cured UHP-1 paste: (a) M vs. H, (b) M vs. Ica, and (c) Ica vs. Isi. 


2.4. Inputs to the machine learning classification model and the rationale 


The ML classification models described in the forthcoming section uses the intensities at 
different indentation points to determine which of the phases (shown in Table 2), the point 
belongs to. The datasets for both mixtures and ages shown in Table 1 were combined to create 
the most generalizable ML classifier possible. The details of this large dataset are shown in Table 
3. The ability of ML algorithms to accurately classify the constitutive phases in complex 
microstructures belonging to multiple mixtures at different ages is explored. For the first set of 
ML models, 7 inputs were used (i.e., the 4 chemical intensities, y?, and M and H values from 
nanoindentation). In the second set, the ML models were trained only using 5 inputs (i.e., the 4 
chemical intensities and y*). M and H are used as inputs in one set of ML models since the actual 
nanomechanical information is expected to facilitate better learning of the ML models to identify 
the phases during the training stage. This is shown to be true later in this paper, especially for 
more complex microstructures such as the UHP pastes. To test the correlation between the 
predicted phase labels and the 7 inputs, Pearson correlation coefficients (or linear correlation 
coefficients) [20,25] were determined as shown in Fig. 3. It can be noticed that all the inputs are 
reasonably correlated to the phase label output. M and H have the greatest correlation with the 
phase labels, and all the chemical intensities are quite similarly related to the phase label output. 
The high correlation between the phase label output and M and H means that the efficiency of 
ML classification models that uses only chemical intensities from SEM-EDS (which is the 
preferred approach, since this data is easier to obtain than M and H) could suffer, which is 
evaluated in this paper. Generating ML models with and without nanoindentation data provides 


quantification of the tradeoff of only including SEM-EDS data as inputs. 
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Table 3 
Details of the input dataset for the ML models, including y?, H, M, and RGB intensities of Al, Ca, Fe, and 
Si. 


Dataset No. of H M 
(see Table 1 for data Statistic Ia) lea Ike Ig; y? 
mixture details) points (OED (Gea) 
Combined dataset Max 252 252 252 252 1.47 x 107 23.20 235.54 
belonging to UHP-1 Mean 49 159 81 57 2.18 x 10° 3.45 56.92 


@ 30d, 90dandUHP- 7476 


: 3 
2 @ 30d Min 4 4 4 4 6.85 x 10 0.43 12.87 


cs 
ga 
24 


Fig. 3. Pearson coefficient heat map for the correlation between the 7 inputs and the phase label output. 


3. Machine learning and data processing 


The different machine learning (ML) techniques used for classification, along with the data pre- 
processing and parameter optimization methods, are summarized here. 


3.1. Machine learning techniques 


Artificial neural network (ANN) and forest ensemble methods are the ML algorithms used for 
the multi-classification (i.e., more than 2 classes or phase labels) reported in this paper. ANNs 
can learn very complex patterns of data, and thus is a preferred ML algorithm for many 
materials-related problems [23,26,46—48]. The ANNs used in this study utilize 2 to 3 hidden 
layers, which are appropriate for the number of unique data records used. The chosen activation 
function to relate neurons [46] is the rectified linear unit (ReLu) with optimization performed 
using RMSprop, which features an adaptive learning rate formula [49]. Backpropagation, using 
the gradient of the previous iteration to train the weights of the ANN, was performed 
automatically by the Keras neural network framework written in Python to build and train the 
ANNs [50]. To minimize over-fitting, a dropout rate, i.e., the probability that any neuron and its 
connections will be temporarily excluded from the network, was incorporated into the ANN [51]. 


8 E. Ford et al./ Journal of Soft Computing in Civil Engineering 5-4 (2021) 01-20 


Machine learning forest ensemble methods are based on the structure of a decision tree that finds 
logical splits in the data leading from one branch to the next until ending at the leaf node [23,52] 
To reduce prediction inaccuracy and over-fitting, the predictions from a collection of decision 
trees are bagged [23,53], termed ensembles. A basic form of forest ensemble is the Random 
Forest (RF) method in which the best split of the data is determined by considering all of the 
input features and checking a criterion, such as Gini impurity, to select the most discriminative 
threshold [52,54]. Each individual decision tree in the RF ensemble does not use the entire set of 
training data, but a bootstrap sample made from subsets of the training data with replacement 
[52,53]. Another forest ensemble is the Extra Trees (ET) regressor in which the splits are drawn 
at random for each feature and the best split, as measured by the chosen criteria, is selected as the 
splitting rule [52,54]. In the ET regression model, the entire dataset is incorporated into each 
individual tree [54]. The prediction results of the individual trees are averaged to produce the 
output prediction in the RF and ET regressions. In a Gradient Boosted Tree (GBT) ensemble, an 
initial tree is trained with the entire dataset. All subsequent trees in the forest are trained to 
minimize the residual between the predicted and actual values of the previous tree [23,54,55]. 
The final prediction is calculated as the weighted sum of the predictions of each tree. For each 
tree beyond the first, the prediction is multiplied by the learning rate, with typical values between 
0.01 and 0.1 [23,54]. A specialized form of the GBT is Extreme Gradient Boosted (XGB) tree 
[55]. XGB performs shrinkage and column subsampling techniques to prevent overfitting 
between boosted trees and additionally offers scalability through parallel tree boosting (efficient 
computing regardless of data size) [55]. 


3.2. Preprocessing and evaluation 


The input data points were pre-processed before separation into the testing and training sets to 
ensure that all the inputs and outputs lie in the range [0, 1] such that: 
Z—Zmin 
Z = —— 4 
eas Zmax— Zmin ( ) 
Here, Zpew is the value of the variable after transformation, z is the current value of the variable, 
and Zmin and Zmax are the minimum and maximum values respectively, of that variable. 


The dataset mentioned in Table 3 was shuffled along the rows of indentation points (Fig. 1(b)) 
such that adjacent points were separated, providing a greater chance of equal distribution of the 
various microstructural entities within the testing and training datasets. Training was performed 
by fitting the ML algorithm to the training dataset and allowing the algorithm to adjust its 
internal features to minimize the error. Model performance was evaluated using the testing 
dataset, which the ML algorithm has not yet seen, and measuring the resulting errors. To evaluate 
the accuracy of the ML predictions, a stratified n-fold cross-validation technique was employed 
[23,25,52] Stratified splitting refers to preserving the percentage of samples in each class within 
each fold [54]. A 3-fold cross-validation, deemed sufficient for the size of the datasets, was 
performed using the following steps: (i) randomizing the dataset and performing a 3-fold 
stratified split, (11) training the model using 2 of the folds, (iii) testing the model using the 
remaining fold, (iv) repeating steps (ii) and (iii) until each fold has been used for testing once, 
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acquiring 3 independent performance measures, and (v) averaging the individual metrics 
measured to obtain the cross-validation value. 


Among the several assessment methods for ML-based classification [56], the area under the 
Receiver Operator Characteristic curve (ROC-AUC) is chosen here since it is an important 
metric for checking any classification model’s performance [56-58]. A ROC-AUC of 1.0 
indicates most accurate classification. The ROC curve is created by plotting the true positive rate 
(also called sensitivity or recall) against the false positive rate (also called false alarm rate or 
fallout) at various threshold settings [56,57]. The ROC-AUC is a measure of how well a model 
can discriminate between two classes (or microstructural phase labels, in this case), and is 
insensitive to the changes in the class distribution [56,57]. In the case of multi-class labeling, 
however, ROC-AUC can be calculated using two different methods. The One-versus-Rest (OvR) 
strategy calculates the model’s ability to discriminate between one class vs. the rest of the 


classes, while the One-versus-One (OvO) strategy pairs each class against another such that, for 

n*(n—-1) 
2 

changes [57] while the latter is insensitive to class distribution, but computationally more 


expensive when the class number increases. In this study with 5 phase labels and data that is not 
significantly imbalanced, which requires special class distribution considerations [59], the more 
general OvR method was employed. The multi-class dataset was one-hot encoded (ie., 
represented as binary vectors), and a ML classifier trained to predict the probability that a data 
point belonged to each phase label. The phase label with the highest probability is taken as the 
prediction for each point. In training, the goal of the ML models was to maximize the objective 
function, which was the ROC-AUC. Other metrics tracked, but not used to train the models, 
were the accuracy and the F1 score, given as [56]: 


n phase labels, calculations are made [58]. The former is sensitive to class distribution 


TP+TN 
TP+TN+FP+FN 


(5) 


Accuracy = 


2TP 


F1 = —— 
2TP+FP+FN 


(6) 
where TP is the number of true positives, TN is the number of true negatives, FP is the number 
of false positives, and FN is the number of false negatives, predicted by the model for each class 
(or phase label). True positives indicate the success in identification of the correct phase label. 


3.3. Hyperparameter optimization 


For all the models, the parameters which maximized the 3-fold cross-validation ROC-AUC were 
used as the basis for the final models, with some additional fine-tuning. The parameters to 
optimize in the ANN models were the number of hidden layers, the number of neurons in each 
hidden layer, and the dropout rate. ReLu activation function with a learning rate of 0.001 and an 
RMSprop optimization scheme was used. For the RF, ET, and GBT models, the number of trees 
in the forest, the maximum depth of the trees, the minimum number of samples before splitting, 
and the minimum number of samples per leaf were tuned. Coarse optimization of the 
hyperparameters for ANN and the forest ensembles followed a random search pattern, found to 
be the most efficient method to optimize parameters [60], by randomly generating 20 different 
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combinations of hyperparameters. The hyperparameters for random testing were chosen from the 
uniform distributions shown in Table 4. 


For the XGB models, there are many hyperparameters available to tune, nine of which were 
chosen for this study. The hyperparameters range from structure-based, such as the depth of the 
trees or the number of GBTs, to how splits are made via the subsample and colsample_bytree 
parameters, or even how big the leaf groups can be via min_child_weight. Additional parameters 
tuned included the learning rate, the minimum objective function loss required to split a leaf 
node called gamma, as well as the L1 and L2 regularization terms on the weights called alpha 
and lambda, respectively. Each hyperparameter was tested one at a time over a grid within the 
range of values indicated in Table 4, where the best value was used when searching for the next 
parameter. The order of hyperparameter selection is given by the order of parameters in Table 4 
for XGB. This process was continued until the end when several different learning rates and 
number of trees were tested as a final tuning effort. Detailed breakdown of the allowed ranges 
and significance of each of these hyperparameters are given in the XGB code documentation 
[Sah 


Beane tuned based on a uniform distribution range of potential values. 
Model Hyperparameter Uniform Distribution Range 
# hidden layers [1, 4] 
ANN # starting neurons [20, 75] 
Drop rate [0, 0.3] 
# of trees [50, 400] 
Random Forest (RF), Extra Trees Maximum depth [3, 21] 
(ET) Forest, and Gradient Boosted 
Trees (GBT) Minimum# of samples before split [2, 25] 
Minimum # of samples on leaf [1, 10] 
# of trees [0, 500] 
Maximum depth [1, 9] 
min_child_weight [1, 6] 
Gamma [0, 0.8] 
XGB Subsample [0.5, 1.0] 
Colsample_bytree [0.2, 1.0] 
Alpha [1E-5, 1] 
Lambda [1E-5, 1.05] 


Learning Rate [0.05, 0.3] 
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4. ML-based classification of cementitious phases 


4.1. UHP pastes 


The predictive efficiency of the different ML models using SEM-EDS data with and without 
nanoindentation hardness (H) and stiffness (M) as inputs, to classify the UHP phase at each 
desired location is reported in this section. Each of the five ML algorithms (ANN, RF, ET, GBT, 
XGB) discussed above were implemented on the data to examine the applicability of the ML 
classification methodology to identify the phase labels in complex and heterogeneous UHP 
pastes. Table 5 lists the ROC-AUC, accuracy, and F1 values for the final ML classification 
models for the 7-input and 5-input cases. The bolded entries indicate the ML models where the 
OvR ROC-AUC results from 3-fold cross-validation were the highest. Note that the 3-fold cross- 
validation trials could not be plotted directly, instead, Fig. 4 and Fig. 5 were generated from a 
75%/25% data split such that 75% of the data points were used for training and 25% were used 
for testing and displaying the plots, where the results were almost identical to the 3-fold cross- 
validation results reported in Table 5. 


In the case of the datasets with 7 inputs (both SEM-EDS and nanomechanical data), all the ML 
models performed very well in terms of all three metrics (ROC-AUC, Accuracy, and F1), with 
the GBT model showing a slightly better performance. The ROC-AUC value was around 0.99 
(1.0 being the absolute best) [56,57], indicating the efficiency of the classification algorithms in 
being able to determine the phase labels based on the given input data. Even when the 
nanomechanical data was removed from the datasets and the input matrix reduced to 5 SEM- 
EDS input parameters, the ML classification algorithms worked quite well with a ROC-AUC 
value of around 0.92. In this case, the ANN model provided the best ROC-AUC value, while the 
forest models also showed very similar performance. The high ROC-AUC values show that, in 
an OvR setting, the classification ML algorithms used are successful in distinguishing one class 
compared to all the others. However, it can be also seen from Table 5 that there is a sharp 
reduction in the accuracy and F1 scores, which both depend on the number of correctly identified 
data points as described using Equations 5 and 6 [56], when the nanomechanical information is 
absent. This is to be expected, since M and H had the highest correlation with the output phase 
label, as indicated in Fig. 3. It is observed that high accuracy and F1 values, along with high 
ROC-AUC, can be achieved when additional, relevant input data such as M and H are available. 


Fig. 4(a) and (b) show the ROC curves obtained from these best-performing models for the 7- 
input and 5-input cases, respectively. As expected, and shown in Table 5, the ROC curves shift 
downward when the nanomechanical inputs are excluded from the ML classification analysis. 
However, it is important to note that not including M and H, which correlated the most with the 
phase label output (see Fig. 3), still produces reasonable identification of the microstructural 
phases just based on SEM-EDS information. This is significant in that, the use of SEM-EDS 
chemical maps along with a ML classification scheme allows for: (i) identification of potential 
phases present at those locations, which provides detailed insights into the influence of material 
composition on microstructure, and (ii) prediction of important paste properties (such as 
modulus) based on the known properties of the phases and their volume fractions. 
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Table 5 

Efficiency metrics of the ML classification algorithms for phases in UHP pastes from SEM-EDS (5 
inputs), and with two additional inputs, M and H, from nanoindentation (7 inputs). Average and standard 
deviation from 3-fold cross-validation is reported. The ML model with the greatest ROC-AUC for each 
number of inputs is shown in bold. 


a oF Model Type ROC-AUC Accuracy Fl 
Inputs 
ANN 0.988 + 0.003 0.906+0.009 0.912+0.009 
Random Forest 0.988 + 0.003 0.903 + 0.010 0.911+40.010 
7 Extra Trees Forest 0.986 + 0.003 0.889+0.015 0.898+0.014 
Gradient Boosted Trees 0.989 + 0.002 0.908 + 0.011 0.914 + 0.012 
XGB 0.988 + 0.003 0.907+0.017 0.914+0.016 
ANN 0.926 + 0.002 0.726+0.013 0.745+ 0.014 
Random Forest 0.924 + 0.002 0.728 + 0.010 0.749 + 0.012 
5 Extra Trees Forest 0.924 + 0.003 0.715+0.012 0.729+0.013 
Gradient Boosted Trees 0.919 + 0.004 0.719+0.015 0.746+0.016 
XGB 0.921 + 0.003 0.721+0.008 0.743+0.010 


True Positive Rate 
True Positive Rate 


ROC LD CSH (area = 1.00) 
ROC HD CSH (area = 0.99) 
ROC UHS (area = 0.98) 
ROC Mixed (area = 0.98) 
ROC Clinker (area = 0.98) 
" Average ROC (area = 0.99) 


ROC LD CSH (area = 1.00) 
ROC HD CSH (area = 0.94) 
ROC UHS (area = 0.90) 
ROC Mixed (area = 0.91) 


ROC Clinker (area = 0.88) 
= Average ROC (area = 0.93) 


0.0 0.2 0.4 0.6 0.8 1.0 


False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 
False Positive Rate 
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Fig. 4. Receiver-Operator Curves (ROC) showing One-versus-Rest results for ML classification using 
25% of data for testing: (a) GBT ML model with 7 inputs, (b) ANN ML model with 5 inputs. The dashed 
diagonal line represents the random guess of a class. The chosen models are the best performing ones 
based on Table 5. 


Further information on the predictive performance of the classification models can be gleaned 
from confusion matrices presented in Fig. 5(a) and (b) for the 7-input GBT ML and 5-input ANN 
ML classification models, respectively. For both the input types the LD C-S-H phase is 
accurately identified in 94-100% of the points by the ML models as shown in Fig. 5. Similarly, 
the HD C-S-H phase is correctly classified in 83-94% of the points, depending on whether the 5- 
input or 7-input models are used. Since LD C-S-H and HD C-S-H have differences in their 
packing densities, which result in different mechanical properties [61,62], it is only natural that a 
ML model that is trained using nanomechanical data also shows near-perfect capability in 
accurately identifying these phases. However, cluster analysis in several past work [14,17,63] 
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have shown dissimilarities in chemical intensities between these phases, which enables the 5- 
input model also to perform satisfactorily in classifying these phases. As indicated in the authors’ 
recent work [15,16], the remaining three hard-stiff phases, viz., UHS, mixed phase containing 
limestone, carboaluminates and fly ash, and clinker, with indentation moduli of ~43 GPa [61], 
~75 GPa [21,44,64,65], and ~100 GPa [47] respectively, overlap in terms of chemical intensities 
and stiffnesses. This is clearly noticed in the scatter of points corresponding to these phases in 
Fig. 2(c). Reducing the number of inputs from 7 to 5 clearly has a significant adverse effect on 
the classification of these phases as noted from Fig. 5. In the 7-input model, the mixed phase is 
correctly identified in ~92% of the cases, while the classification accuracy drops down to ~63% 
in the 5-input model, where the mixed phase is confused with the UHS phase in many instances. 
In both the models, clinker classification has the lowest accuracy. In the 5-input model, the 
clinker classification accuracy is around 50%, with a significant number of clinker locations mis- 
identified as HD C-S-H phases due to the absence of corroborating nanoindentation data. It is 
also notable that the EDS chemical maps were obtained based on qualitative measurements and 
not on quantitative spot chemical analysis [18,32], and therefore only provide relative atomic 
ratios and not the exact ratios. As such, it is likely that cementitious phases with similar Ca/Si 
ratios, but different stiffnesses may be confused for one another in the 5-input ML model. 
Another explanation for the confusion between the clinker and HD C-S-H phases is that, in the 
UHP-2 mix there was no HD C-S-H cluster identified, and the reaction product belonged to the 
UHS phase [16] however, when the Ca and Si intensities were plotted for the clinker and UHS 
phases, they almost perfectly overlapped [16]. The high unreacted limestone content in this 
mixture could have resulted in excess Ca in the chemical map that contributed to a higher Ca/Si 
ratio, which is typical of clinker. This may have led to the confusion of the ML to differentiate 
between the clinker and UHS/HD C-S-H phases for the UHP-2 mixture. It is once again shown 
that, in complex microstructures where chemical intensities overlap between phases (as shown in 
Fig. 2(c)), the use of additional inputs in the form of nanomechanical properties help 
classification significantly. 
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Fig. 5. Confusion matrices showing results for ML classification using 25% of data for testing: (a) GBT 
ML model with 7 inputs, and (b) ANN ML model with 5 inputs. Percentage accuracy in each row is given 
based on the total number of data points in each phase label, as shown along the diagonal. In an ideal 
case, it is desirable to have a classification accuracy near 100% on the boxes along the diagonal, which 
would result in little to no misidentification, and thus, close to 0% on all the other boxes. 
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4.2. Validation of the classification approach using other cement paste data 


To validate the ML classification of microstructural phases through chemical intensities from 
SEM-EDS, two new datasets were curated from literature [32,33] and similar ML models 
developed to classify their phases. These datasets are referred to as OPC (plain cement paste) 
[32] and NP (20% of cement by mass replaced with a natural pozzolan) [33]. Nanoindentation 
and chemical mapping data reported in [32,33] identified several clusters of microstructural 
phases in these mixtures. The OPC data identified 5 clusters by BIC and negative log likelihood 
method in [32], however two clusters with the highest stiffness and hardness could be grouped 
together as part of the clinker phase to ensure that the same ML algorithms as described above 
can be used here. The remaining three clusters were labeled as LD C-S-H, HD C-S-H, and a 
mixed phase. For the NP data, 6 clusters were identified in [33], but clusters 5 and 6 were 
grouped as together as they were both identified as clinker phases [33], with LD C-S-H, HD C-S- 
H, UHS, and mixed phase labels given to the remaining clusters. The only available inputs were 
three elemental intensities, Ic, Isj, and I,;, along with M and H. Thus, ML models using all the 5 
inputs, or just the 3 chemical signature inputs, were implemented. To keep the discussions 
succinct, only three forest ensemble models (RF, ET, and GBT; which generally are the faster 
ML models) are used here for the validation tests. Table 6 lists the resulting ROC-AUC, 
accuracy, and F1 values for these datasets. Similar to the UHP pastes, there is a decrease across 
all metrics of classification going from 5 inputs (which included the micromechanical M and H) 
to 3 inputs. However, this decrease is much lower, owing to the greatly reduced complexity in 
these microstructures that were well hydrated. As compared to the UHP pastes evaluated in the 
previous section, these pastes demonstrate reduced heterogeneity with fewer starting ingredients, 
proportioned using a higher w/b, and having undergone higher degrees of reaction, which 
influences the predictive accuracy as detailed in [29]. Based on the results in Table 6, the chosen 
ML classification methods can be considered to be successful in identifying the constituent 
phases, given only the chemical intensities, for less complex microstructures. 


Table 6 

Efficiency metrics of the ML classification algorithm for phases in OPC and NP pastes from SEM-EDS (3 
inputs), and with two additional inputs, M and H, from nanoindentation (5 inputs). Average and standard 
deviation from 3-fold cross-validation is reported. The most accurate ML model for each number of 
inputs is shown in bold. 


Dataset. ee Model Type ROC-AUC Accuracy Fl 
Inputs 
Random Forest 0.975 + 0.010 0.888+0.018 0.893+0.018 
5 Extra Trees Forest 0.981 + 0.005 0.897+ 0.011 0.899+ 0.011 
OPC Gradient Boosted Forest 0.968 + 0.011 0.858+0.043 0.867 + 0.037 
Random Forest 0.951 + 0.011 0.808+ 0.062 0.801 +0.064 
3 Extra Trees Forest 0.958 + 0.012 0.829+ 0.040 0.837 + 0.037 
Gradient Boosted Forest 0.945 + 0.017 0.817+0.040 0.827+0.031 
Random Forest 0.988 + 0.006 0.891+0.023 0.891+0.019 
5 Extra Trees Forest 0.989 + 0.007 0.902 +0.038 0.897+ 0.039 
NP Gradient Boosted Forest 0.991 + 0.006 0.925+ 0.026  0.925+ 0.022 
Random Forest 0.973 + 0.008 0.860+0.019 0.859 + 0.022 
3 Extra Trees Forest 0.973 + 0.006 0.832+0.041 0.826+0.039 
Gradient Boosted Forest 0.965 + 0.006 0.822+0.028 0.819+0.032 
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Fig. 6 shows the confusion matrices and ROC curves for the OPC and NP mixtures, for the 3- 
input cases. The classification accuracy is very high as noted from the confusion matrices for 
both the pastes, attributable to the relative simplicity of their microstructures as compared to the 
UHP pastes. There are very few mis-labeled indentation points even when the nanomechanical 
data is not provided. The results show the application of ML-based classification algorithms in 
labeling the microstructural phases in cementitious systems. 
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Fig. 6. ML classification using 25% of data for testing: (a) and (b) confusion matrix and ROC curves for 
the 3-input ET model for the OPC paste; (c) and (d) confusion matrix and ROC curves for the 3-input RF 
model for the NP paste. 


5. Summary and Conclusions 


This study has presented a novel approach to accurately predict cement hydration phases from 
chemical intensity maps, using ML methods. Chemical intensity data from SEM-EDS for 
different UHP cement paste datasets representing multiple cementing materials and hydration 


16 E. Ford et al./ Journal of Soft Computing in Civil Engineering 5-4 (2021) 01-20 


ages were combined. Micromechanical information from nanoindentation as well the elemental 
intensities from qualitative EDS maps were then coupled with Bayesian statistical clustering. 
With the phase labels (e.g., LD or HD C-S-H, clinker etc.) thus identified, different ML 
classification techniques based on Artificial Neural Networks (ANN) and forest ensemble 
methods were implemented on the dataset. The area under the Receiver Operator Characteristic 
curve (ROC-AUC) was chosen as the indicator of model performance. 


It was observed that, for the combined dataset of the UHP pastes, the removal of nanoindentation 
information from the datasets did impact the efficiency of classification. Confusion matrices 
demonstrated that the removal of nanoindentation information resulted in misidentification of 
some of the microstructural labels, especially where the chemical intensity data overlapped 
between multiple phases due to the unique composition of the UHP pastes. It was shown that, in 
such complex systems, the use of additional inputs in the form of nanomechanical properties 
help classification significantly. The same approach was also used on two less complex 
microstructures (i.e., fewer starting materials and more complete hydration), one of a plain OPC 
paste and the other a paste with 20% OPC replaced using a highly reactive natural pozzolan. 
Here, normalized intensities of just the three chemical species (Ca, Si, and Al) were deemed 
sufficient (without nanoindentation data) to generate a highly accurate classifier. It is shown that 
chemical intensity mapping of microstructures, coupled with machine learning, can be used to 
accurately (in the case of common cementitious microstructures) classify the microstructural 
phases, which can lead to apriori property (e.g., stiffness) predictions. ML models can thus 
classify the cementitious component phase at locations in a microstructure to facilitate real-time 
characterization and first-order estimation of bulk properties. 
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